Reinforcement Learning: An Introduction: Bridging Learning and Planning: The Internal Model

This lecture establishes the conceptual bridge between direct reinforcement learning and planning by introducing the Internal Model. We define the model as any mechanism that mimics the environment's behavior, allowing the agent to predict future states and rewards—a process often called system identification.

The Frozen Lake Analogy

Imagine a self-driving car learning to navigate a frozen lake. Direct RL occurs when the car actually drives, slips on ice, and receives a negative reward, immediately updating its value function. Planning occurs while the car is parked; it uses its Internal Model (a mental map of where the ice was) to simulate thousands of hypothetical turns, updating its policy without ever moving a tire or risking a collision.

Core Insights

Indirect RL: Also known as planning, this uses experience to improve a model, which then generates simulated experience to perform the exact same value updates as direct RL.
The Internal Model as a Simulator: In tabular methods, this is typically done by recording the transitions and rewards observed; planning then involves "sampling" from this history.
Algorithmic Unity: Learning and planning are fundamentally identical in their mathematical execution. They both use reinforcement learning backup algorithms (like Q-learning or Sarsa); the only difference is the source of the experience (real vs. simulated).

QUESTION 1

Which of the following best describes the difference between Direct and Indirect Reinforcement Learning?

Direct RL requires a model, while Indirect RL uses only raw experience.

Direct RL updates values using environment interaction, while Indirect RL uses simulated experience from a model.

Direct RL is always more sample-efficient than Indirect RL.

QUESTION 2

In the context of tabular Dyna-Q, what is 'System Identification'?

The process of identifying which agent is active in a multi-agent system.

The process of using experience to improve the agent's internal model of the environment.

The process of finding the goal state in a maze.

Interactive Analysis: Dyna-Q and Exploration

Analyze the impact of models on agent performance.

A Dyna agent is tested in two phases: first, a blocking experiment where the optimal path is closed, and second, a shortcut experiment where a better path is suddenly opened.

1. Why did the Dyna agent with exploration bonus, Dyna-Q+, perform better in the first phase as well as in the second phase of the blocking and shortcut experiments?

Answer:
Dyna-Q+ uses a reward bonus R + κ√τ (where τ is time since last visit). This encourages the agent to visit state-action pairs that haven't been tried recently. In the first phase, this increased exploration helps the agent find the optimal path more reliably. In the second phase, when the environment changes (a path opens or closes), the bonus ensures the agent will eventually re-test the changed area and discover the new optimal path, whereas standard Dyna-Q might remain stuck in a now-suboptimal routine.

2. Mathematically, why can we use the same backup algorithm (e.g., Q-learning) for both learning and planning?

Answer:
Because the update rule (backup) only cares about the tuple (S, A, R, S'). It does not care if that tuple was generated by the physical environment or by a transition function inside the model. This is the principle of Algorithmic Unity.